<<<<<<< HEAD ======= >>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec Drawing graphs <<<<<<< HEAD ======= >>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec <<<<<<< HEAD ======= >>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Drawing graphs

Our data

  • To illustrate making graphs, we need some data.
  • Data on 202 male and female athletes at the Australian Institute of Sport.
  • Variables:
    • categorical: Sex of athlete, sport they play
    • quantitative: height (cm), weight (kg), lean body mass, red and white blood cell counts, haematocrit and haemoglobin (blood), ferritin concentration, body mass index, percent body fat.
  • Values separated by tabs (which impacts reading in).

Packages for this section

<<<<<<< HEAD
library(tidyverse)
=======
library(tidyverse)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Reading data into R

  • Use read_tsv (“tab-separated values”), like read_csv.
  • Data in ais.txt:
<<<<<<< HEAD
my_url <- "http://ritsokiguess.site/datafiles/ais.txt"
athletes <- read_tsv(my_url)
=======
my_url <- "http://ritsokiguess.site/datafiles/ais.txt"
athletes <- read_tsv(my_url)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

The data (some)

<<<<<<< HEAD
athletes
=======
athletes
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Types of graph

Depends on number and type of variables:

Categorical Quantitative Graph
1 0 bar chart
0 1 histogram
2 0 grouped bar charts
1 1 side-by-side boxplots
0 2 scatterplot
2 1 grouped boxplots
1 2 scatterplot with points identified by group (eg. by colour)

With more (categorical) variables, might want separate plots by groups. This is called facetting in R.

ggplot

  • R has a standard graphing procedure ggplot, that we use for all our graphs.
  • Use in different ways to get precise graph we want.
  • Let’s start with bar chart of the sports played by the athletes.

Bar chart

<<<<<<< HEAD
ggplot(athletes, aes(x = Sport)) + geom_bar()
=======
ggplot(athletes, aes(x = Sport)) + geom_bar()
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Histogram of body mass index

<<<<<<< HEAD
ggplot(athletes, aes(x = BMI)) + geom_histogram(bins = 10)
=======
ggplot(athletes, aes(x = BMI)) + geom_histogram(bins = 10)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Which sports are played by males and females?

Grouped bar chart:

<<<<<<< HEAD
ggplot(athletes, aes(x = Sport, fill = Sex)) +
  geom_bar(position = "dodge")

BMI by gender

ggplot(athletes, aes(x = Sex, y = BMI)) + geom_boxplot() 
=======
ggplot(athletes, aes(x = Sport, fill = Sex)) +
  geom_bar(position = "dodge")

BMI by gender

ggplot(athletes, aes(x = Sex, y = BMI)) + geom_boxplot() 
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Height vs. weight

Scatterplot:

<<<<<<< HEAD
ggplot(athletes, aes(x = Ht, y = Wt)) + geom_point()
=======
ggplot(athletes, aes(x = Ht, y = Wt)) + geom_point()
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

With regression line

<<<<<<< HEAD
ggplot(athletes, aes(x = Ht, y = Wt)) +
  geom_point() + geom_smooth(method = "lm")
=======
ggplot(athletes, aes(x = Ht, y = Wt)) +
  geom_point() + geom_smooth(method = "lm")
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

BMI by sport and gender

<<<<<<< HEAD
ggplot(athletes, aes(x = Sport, y = BMI, fill = Sex)) +
  geom_boxplot()

=======
ggplot(athletes, aes(x = Sport, y = BMI, fill = Sex)) +
  geom_boxplot()

>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

A variation that uses colour instead of fill:

<<<<<<< HEAD
ggplot(athletes, aes(x = Sport, y = BMI, colour = Sex)) +
  geom_boxplot()

=======
ggplot(athletes, aes(x = Sport, y = BMI, colour = Sex)) +
  geom_boxplot()

>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Height and weight by gender

<<<<<<< HEAD
ggplot(athletes, aes(x = Ht, y = Wt, colour = Sex)) +
  geom_point()

Height by weight by gender for each sport, with facets

ggplot(athletes, aes(x = Ht, y = Wt, colour = Sex)) +
  geom_point() + facet_wrap(~Sport)
=======
ggplot(athletes, aes(x = Ht, y = Wt, colour = Sex)) +
  geom_point()

Height by weight by gender for each sport, with facets

ggplot(athletes, aes(x = Ht, y = Wt, colour = Sex)) +
  geom_point() + facet_wrap(~Sport)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Filling each facet

Default uses same scale for each facet. To use different scales for each facet, this:

<<<<<<< HEAD
ggplot(athletes, aes(x = Ht, y = Wt, colour = Sex)) +
  geom_point() + facet_wrap(~Sport, scales = "free")

Another view of height vs weight

ggplot(athletes, aes(x = Ht, y = Wt)) +
  geom_point() + facet_wrap(~ Sex)
=======
ggplot(athletes, aes(x = Ht, y = Wt, colour = Sex)) +
  geom_point() + facet_wrap(~Sport, scales = "free")

Another view of height vs weight

ggplot(athletes, aes(x = Ht, y = Wt)) +
  geom_point() + facet_wrap(~ Sex)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Normal quantile plot

For assessing whether a column has a normal distribution or not:

<<<<<<< HEAD
ggplot(athletes, aes(sample = BMI)) + stat_qq() + stat_qq_line()
=======
ggplot(athletes, aes(sample = BMI)) + stat_qq() + stat_qq_line()
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Comments

Facetting

Male and female athletes’ BMI separately:

<<<<<<< HEAD
ggplot(athletes, aes(sample = BMI)) + stat_qq() + stat_qq_line() +
  facet_wrap(~ Sex)
=======
ggplot(athletes, aes(sample = BMI)) + stat_qq() + stat_qq_line() +
  facet_wrap(~ Sex)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Comments

More normal quantile plots

Normal data, large sample

<<<<<<< HEAD
d <- tibble(x=rnorm(200))
ggplot(d, aes(x=x)) + geom_histogram(bins=10)
=======
d <- tibble(x=rnorm(200))
ggplot(d, aes(x=x)) + geom_histogram(bins=10)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

The normal quantile plot

<<<<<<< HEAD
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
=======
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Normal data, small sample

<<<<<<< HEAD
d <- tibble(x=rnorm(20))
ggplot(d, aes(x=x)) + geom_histogram(bins=5)
=======
d <- tibble(x=rnorm(20))
ggplot(d, aes(x=x)) + geom_histogram(bins=5)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

The normal quantile plot

Good, apart from the highest and lowest points being slightly off. I’d call this good:

<<<<<<< HEAD
ggplot(d, aes(sample=x)) + stat_qq() + stat_qq_line()
=======
ggplot(d, aes(sample=x)) + stat_qq() + stat_qq_line()
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Chi-squared data, df = 10

Somewhat skewed to right:

<<<<<<< HEAD
d <- tibble(x=rchisq(100, 10))
ggplot(d,aes(x=x)) + geom_histogram(bins=10)
=======
d <- tibble(x=rchisq(100, 10))
ggplot(d,aes(x=x)) + geom_histogram(bins=10)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

The normal quantile plot

Somewhat opening-up curve:

<<<<<<< HEAD
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
=======
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Chi-squared data, df = 3

Definitely skewed to right:

<<<<<<< HEAD
d <- tibble(x=rchisq(100, 3))
ggplot(d, aes(x=x)) + geom_histogram(bins=10)
=======
d <- tibble(x=rchisq(100, 3))
ggplot(d, aes(x=x)) + geom_histogram(bins=10)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

The normal quantile plot

Clear upward-opening curve:

<<<<<<< HEAD
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
=======
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

t-distributed data, df = 3

Long tails (or a very sharp peak):

<<<<<<< HEAD
d <- tibble(x=rt(300, 3))
ggplot(d, aes(x=x)) + geom_histogram(bins=15)
=======
d <- tibble(x=rt(300, 3))
ggplot(d, aes(x=x)) + geom_histogram(bins=15)
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

The normal quantile plot

Low values too low and high values too high for normal.

<<<<<<< HEAD
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
=======
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
>>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec

Summary

On a normal quantile plot:

High points
Low points Too low Too high
Too low Skewed left Long tails
Too high Short tails Skewed right
<<<<<<< HEAD
======= >>>>>>> 1b9bd782f66c30e0c75454760e7e9aebd48337ec